Take Home Exercise 3

Author

Georgia Ng

Published

October 17, 2024

Modified

October 19, 2024

1. Overview

1.1 Introduction

1.2 My Responsibilities

  • Data Preparation, Preprocessing

1.3 Importing Packages

Here, we have loaded the following packages:

pacman::p_load(sf, sfdep, tmap, tidyverse, RColorBrewer, ggplot2, spatstat)

2. The Data

For this project, we will be using the following data sets.

  • Singapore Resale Flat Prices (Jan-17 to Sep-24) from Kaggle, an accumulation of information relating to the sale of Singapore’s public housing apartments colloquially referred to as flats

    • This dataset augments the original dataset by including 4 important categories of information:

      1. X/Y lat/lng coordinates, which can be used for geospatial plotting.

      2. Information about the closest MRT station to the flat

      3. Information about the closest primary school to flat

      4. the URA planning area (or town) of the flat.

  • MP

  • HDB Property Info from data.gov.sg

2.1 Importing Geospatial Data

The code chunk below is used to import MP_SUBZONE_WEB_PL shapefile by using st_read() of sfpackages.

mpsz <- st_read(dsn = "data/geospatial/MasterPlan2014SubzoneBoundaryWebSHP/", layer = "MP14_SUBZONE_WEB_PL")
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `/Users/georgiaxng/georgiaxng/is415-handson/Take-home_Ex/Take-home_Ex03/data/geospatial/MasterPlan2014SubzoneBoundaryWebSHP' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21

Here is a plot of Singapore.

plot(mpsz)

2.2 Importing Aspatial Data

2.2.1 Importing Resale Flat Prices

The code chunk below is used to import the Resale Flat Prices dataset from Kaggle.

resale_df = read_csv('data/aspatial/resale_hdb_price_for_kaggle_2024-30sep.csv')

To get a brief overview of existing columns of this dataset, we can use colnames() to do so.

colnames(resale_df)
 [1] "...1"                          "month"                        
 [3] "storey_range"                  "floor_area_sqm"               
 [5] "flat_model"                    "lease_commence_date"          
 [7] "remaining_lease"               "resale_price"                 
 [9] "floor_area_sqft"               "price_per_sqft"               
[11] "blk_no"                        "road_name"                    
[13] "building"                      "postal"                       
[15] "address"                       "lease_commence_date_r"        
[17] "planning_area_ura"             "region_ura"                   
[19] "x"                             "y"                            
[21] "latitude"                      "longitude"                    
[23] "closest_mrt_station"           "distance_to_mrt_meters"       
[25] "transport_type"                "line_color"                   
[27] "distance_to_cbd"               "closest_pri_school"           
[29] "distance_to_pri_school_meters"

2.2.1.1 CRS Adjustments

Another important step after importing the dataset is checking the coordinate system used, as seen in the result below using st_crs(), we can see that there is no CRS.

st_crs(resale_df)
Coordinate Reference System: NA

Therefore, we need to convert the longitude and latitude columns into a spatial format. Since our dataset is based in Singapore and it uses the SVY21 coordinate reference system (CRS Code: 3414), we will use the st_transform() function to perform the conversion and create the geometry column.

resale_sf <- resale_df %>%
  st_as_sf(coords = c("longitude", "latitude"), crs=4326) %>%
  st_transform(crs = 3414)

Using st_crs(), we can check and verify that the conversion is successful.

st_crs(resale_sf)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]
head(resale_df)
# A tibble: 6 × 29
   ...1 month      storey_range floor_area_sqm flat_model    lease_commence_date
  <dbl> <date>     <chr>                 <dbl> <chr>         <date>             
1     0 2017-01-01 10 TO 12                 44 Improved      1979-01-01         
2     1 2017-01-01 01 TO 03                 67 New Generati… 1978-01-01         
3     2 2017-01-01 01 TO 03                 67 New Generati… 1980-01-01         
4     3 2017-01-01 04 TO 06                 68 New Generati… 1980-01-01         
5     4 2017-01-01 01 TO 03                 67 New Generati… 1980-01-01         
6     5 2017-01-01 01 TO 03                 68 New Generati… 1981-01-01         
# ℹ 23 more variables: remaining_lease <chr>, resale_price <dbl>,
#   floor_area_sqft <dbl>, price_per_sqft <dbl>, blk_no <chr>, road_name <chr>,
#   building <chr>, postal <chr>, address <chr>, lease_commence_date_r <date>,
#   planning_area_ura <chr>, region_ura <chr>, x <dbl>, y <dbl>,
#   latitude <dbl>, longitude <dbl>, closest_mrt_station <chr>,
#   distance_to_mrt_meters <dbl>, transport_type <chr>, line_color <chr>,
#   distance_to_cbd <dbl>, closest_pri_school <chr>, …

2.2.2 Importing HDB Property Information

hdb_info_df <- read_csv('data/aspatial/HDB Property Information.csv')

2.3 Data Wrangling

2.3.1 Adding Housing Type for Flats

To assess whether the housing type influences the resale price of flats, we need to include this variable in our analysis. However, the current dataset lacks this information and thus we will be deriving it using other relevant variables. Specifically, we can use the floor area of the flats to determine the housing type, based on the guidelines provided on data.gov.sg.

  • 2 Rm (2 room HDB Flat). 1 bedroom with a built-in area of about 45 sq m or 485 sq ft.

  • 3 Rm (3 room HDB Flat). 2 bedrooms with a built-in area of about 70 sq m or 754 sq ft.

  • 4 Rm (4 room HDB Flat). 3 bedrooms with a built-in area of about 90 sq m or 969 sq ft.

  • 5 Rm (5 room HDB Flat). 3 bedrooms with a built-in area of about 110 sq m or 1,184 sq ft.

  • EA (Executive Apartment). 3/4 bedrooms with built-in area of about 150 sqm or 1,615 sqft.

  • EM (Executive Mansionette). Same as Executive apartment, except it has two levels.

  • 6 Rm (6 room HDB Flat). Jumbo flat joint by two 3 room flats

Particularly, we will just group EA, EM and 6 rm flats into one category since there is no way for us right now to differentiate just by the size of the flats.

This chunk of code below derives the housing type for each flat by using conditional statements to check if the stated floor area is closer to that of a particular housing type and add the value in the new column created using mutate().

resale_sf <- resale_sf %>%
  mutate(housing_type = case_when(
    floor_area_sqft <= 619 ~ "2 Rm",  # Closer to 485 sqft
    floor_area_sqft > 619 & floor_area_sqft <= 862 ~ "3 Rm",  # Closer to 754 sqft
    floor_area_sqft > 862 & floor_area_sqft <= 1076 ~ "4 Rm",  # Closer to 969 sqft
    floor_area_sqft > 1076 & floor_area_sqft <= 1399 ~ "5 Rm",  # Closer to 1184 sqft
    floor_area_sqft > 1399 ~ "EA/EM/6 Rm",  # Closer to 1615 sqft
  ))

2.3.2 Grouping Remaining Lease By Range

First, let us convert the lease period from chr to total months. Below, we extract the different year and month from the string and then make the calculations to convert it to total months.

# Function to convert lease period to total months
convert_to_months <- function(lease) {
  parts <- strsplit(lease, " ")[[1]]  # Split by space
  years <- as.numeric(parts[1])        # Extract years
  # Check if months are present
  if (length(parts) > 2) {
    months <- as.numeric(parts[3])       # Extract months if present
  } else {
    months <- 0                           # Set months to 0 if not present
  }
  
  total_months <- years * 12 + months  # Convert to total months
  return(total_months)
}
resale_sf <- resale_sf %>%
  mutate(remaining_lease_total_months = sapply(remaining_lease, convert_to_months))

Using the number of total months, we can then get a brief overview of the remaining lease of the resale flat including the mins and max. In this case, we can see that the minimum number of months is 497 (ard 41 years) and maximum is 1173 (around 97 years). This information will later be useful in setting the ranges for the remaining lease.

summary(resale_sf['remaining_lease_total_months'])
 remaining_lease_total_months          geometry     
 Min.   : 497.0               POINT        :190724  
 1st Qu.: 756.0               epsg:3414    :     0  
 Median : 893.0               +proj=tmer...:     0  
 Mean   : 894.4                                     
 3rd Qu.:1062.0                                     
 Max.   :1173.0                                     

Since the minimum is 41 years, our date range can start from 40-49 years onwards.

Note

E.g. 40-49 years refers to 40 years 0 months to 49 years 11 months.

# Define the ranges and labels
breaks <- c(480, 600, 720, 840, 960, 1080, 1200)
labels <- c( "40-49 years", "50-59 years", "60-69 years", "70-79 years", "80-89 years", "90-99 years")

resale_sf <- resale_sf %>%
  mutate(remaining_lease_range = cut(remaining_lease_total_months, breaks = breaks, labels = labels))

2.3.3 Removal of Redundant Columns

# Define columns to be removed
columns_to_remove <- c("floor_area_sqm", "flat_model", "lease_commence_date", "remaining_lease", "blk_no", "road_name", "building", "address", "lease_commence_date_r", "postal","...1")

# Remove columns only if they exist in the dataframe
resale_sf <- resale_sf %>%
  dplyr::select(-all_of(columns_to_remove[columns_to_remove %in% names(resale_sf)]))

2.3.4 Adjustments to Storey Ranges

Through the histogram of the storey ranges, we can observe a right skewed distribution which shows that higher storeys are of a much lower frequency among resale flats. This is likely due to the fact that most HDB flats in Singapore are lower-rise buildings.

# Create a summary of counts for each remaining lease range
count_data <- resale_sf %>%
  group_by(storey_range) %>%
  summarise(count = n())

# Create the bar plot with labels
ggplot(count_data, aes(x = storey_range, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") + 
  geom_text(aes(label = count), vjust = -0.5, size = 4) +  # Add labels on top of the bars
  labs(title = "Count of Storey Range",
       x = "Storey Range",
       y = "Count") +
  theme_minimal()

Here is a summary of the maximum floor levels of all the HDB flats in Singapore. As shown here we can observe that both the mean and median maximum floor is a mere 12.

summary(hdb_info_df$max_floor_lvl)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0     6.0    12.0    12.1    16.0    50.0 

Thus, to simplify our analysis and make the visualization easier to interpret, we can combine the higher storey ranges into a single broader range. In particular, we will group the floors 16 or higher together to form the 16+ floors range.

resale_sf <- resale_sf %>%
    mutate(storey_range_grouped = case_when(
        storey_range %in% c("01 TO 03", "04 TO 06", "07 TO 09", "10 TO 12", "13 TO 15") ~ storey_range,
        storey_range %in% c("16 TO 18", "19 TO 21", "22 TO 24", "25 TO 27", "28 TO 30", "31 TO 33", "34 TO 36", "37 TO 39", "40 TO 42", "43 TO 45", "46 TO 48", "49 TO 51") ~ "16+",
        TRUE ~ storey_range
    ))

# Plot the grouped data
ggplot(resale_sf, aes(x = storey_range_grouped)) +
    geom_bar() +
    geom_text(stat='count', aes(label=..count..), vjust=-0.5) +
    labs(title = "Count of Storey Range (Grouped)",
         x = "Storey Range",
         y = "Count") +
    theme_minimal()

2.3.5 Checking for NA values

2.3.6 Time Period Selection for Analysis

resale_5yr_sf <- resale_sf %>% filter (year(month) >= 2020)
write_rds(resale_5yr_sf,'data/rds/resale_5yr_sf.rds')

2.4 Overview Of Dataset

colnames(resale_5yr_sf)
 [1] "month"                         "storey_range"                 
 [3] "resale_price"                  "floor_area_sqft"              
 [5] "price_per_sqft"                "planning_area_ura"            
 [7] "region_ura"                    "x"                            
 [9] "y"                             "closest_mrt_station"          
[11] "distance_to_mrt_meters"        "transport_type"               
[13] "line_color"                    "distance_to_cbd"              
[15] "closest_pri_school"            "distance_to_pri_school_meters"
[17] "geometry"                      "housing_type"                 
[19] "remaining_lease_total_months"  "remaining_lease_range"        
[21] "storey_range_grouped"         

Explanatory Variables:

Continuous

  • Remaining Lease: remaining_lease_total_months

  • Size of flat: floor_area_sqft

  • Distance to transport: distance_to_mrt_meters

  • Distance to amenities: distance_to_pri_school_meters

  • Distance to central business district: distance_to_cbd

Categorical

  • Remaining Lease: remaining_lease_period

  • Storey Height: storey_range

  • Housing Type: housing_type

Dependent Variables:

  • Resale Price: resale_price, price_per_sqft

3. Shiny Storyboard

4. Distribution

4.1 Categorical Variables

4.1.1 Remaining Lease Range

# Create a summary of counts for each remaining lease range
count_data <- resale_5yr_sf %>%
  group_by(remaining_lease_range) %>%
  summarise(count = n())

# Create the bar plot with labels
ggplot(count_data, aes(x = remaining_lease_range, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") + 
  geom_text(aes(label = count), vjust = -0.5, size = 4) +  # Add labels on top of the bars
  labs(title = "Count of Remaining Lease Ranges",
       x = "Remaining Lease Range",
       y = "Count") +
  theme_minimal()

4.1.2 Storey Range

# Create a summary of counts for each remaining lease range
count_data <- resale_5yr_sf %>%
  group_by(storey_range_grouped) %>%
  summarise(count = n())

# Create the bar plot with labels
ggplot(count_data, aes(x = storey_range_grouped, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") + 
  geom_text(aes(label = count), vjust = -0.5, size = 4) +  # Add labels on top of the bars
  labs(title = "Count of Storey Range",
       x = "Storey Range",
       y = "Count") +
  theme_minimal()

4.1.3 Housing Type

# Create a summary of counts for each remaining lease range
count_data <- resale_5yr_sf %>%
  group_by(housing_type) %>%
  summarise(count = n())

# Create the bar plot with labels
ggplot(count_data, aes(x = housing_type, y = count)) +
  geom_bar(stat = "identity", fill = "steelblue") + 
  geom_text(aes(label = count), vjust = -0.5, size = 4) +  # Add labels on top of the bars
  labs(title = "Count of Housing Type",
       x = "Housing Type",
       y = "Count") +
  theme_minimal()

Bivariate Analysis

Correlation Matrix

Drafts

lease_storey_table <- table(resale_5yr_sf$storey_range_grouped, resale_5yr_sf$remaining_lease_range)
lease_storey_df <- as.data.frame(lease_storey_table)

# Heatmap with ggplot
ggplot(lease_storey_df, aes(Var1, Var2, fill = Freq)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "red") +
  labs(title = "Heatmap: Storey Range vs. Remaining Lease Range",
       x = "Storey Range",
       y = "Remaining Lease Range") +
  theme_minimal()

mpsz_main <- st_union(mpsz) %>%
    st_cast("POLYGON")
mpsz_main <- mpsz_main[c(12)]

mpsz_owin <- as.owin(mpsz_main)
plot(mpsz_owin)

resale_2020_sf <- resale_5yr_sf %>% filter (year(month) == 2020)
resale_2021_sf <- resale_5yr_sf %>% filter (year(month) == 2021)
resale_2022_sf <- resale_5yr_sf %>% filter (year(month) == 2022)
resale_2023_sf <- resale_5yr_sf %>% filter (year(month) == 2023)
resale_2024_sf <- resale_5yr_sf %>% filter (year(month) == 2024)
tmap_mode('plot')
tm_shape(mpsz%>% filter(PLN_AREA_N == 'ANG MO KIO'))+
  tm_polygons()+
tm_shape(resale_2023_sf %>% filter(planning_area_ura == 'ANG MO KIO'))+
  tm_dots()

tmap_mode('plot')
tm_shape(mpsz%>% filter(PLN_AREA_N == 'ANG MO KIO'))+
  tm_polygons()+
tm_shape(resale_2024_sf %>% filter(planning_area_ura == 'ANG MO KIO'))+
  tm_dots()

test <- resale_5yr_sf %>% filter(housing_type == 'EA/EM/6 Rm')
test
Simple feature collection with 10649 features and 20 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 11519.15 ymin: 28097.64 xmax: 42623.63 ymax: 48372.85
Projected CRS: SVY21 / Singapore TM
# A tibble: 10,649 × 21
   month      storey_range resale_price floor_area_sqft price_per_sqft
 * <date>     <chr>               <dbl>           <dbl>          <dbl>
 1 2020-01-01 01 TO 03           700000           1432.           489.
 2 2020-01-01 07 TO 09           770000           1755.           439.
 3 2020-01-01 07 TO 09           623000           1615.           386.
 4 2020-01-01 04 TO 06           615000           1421.           433.
 5 2020-01-01 07 TO 09           590000           1464.           403.
 6 2020-01-01 13 TO 15           765000           1432.           534.
 7 2020-01-01 04 TO 06           525000           1432.           367.
 8 2020-01-01 13 TO 15           758000           1539.           492.
 9 2020-01-01 07 TO 09           680000           1528.           445.
10 2020-01-01 13 TO 15           798888           1539.           519.
# ℹ 10,639 more rows
# ℹ 16 more variables: planning_area_ura <chr>, region_ura <chr>, x <dbl>,
#   y <dbl>, closest_mrt_station <chr>, distance_to_mrt_meters <dbl>,
#   transport_type <chr>, line_color <chr>, distance_to_cbd <dbl>,
#   closest_pri_school <chr>, distance_to_pri_school_meters <dbl>,
#   geometry <POINT [m]>, housing_type <chr>,
#   remaining_lease_total_months <dbl>, remaining_lease_range <fct>, …